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Abstract 

We extend our study of phase transitions in the generalization behaviour of multilayer 
perceptrons with non-overlapping receptive fields to the problem of the influence of noise 
concerning e.g. the input units and/or the couplings between the input units and the hidden 
units of the second layer (='input noise') or the final output unit (='output noise'). Without 
output noise, the output itself is given by a general, permutation-invariant Boolean function 
of the outputs of the hidden units. As a result we find that the phase transitions which we 
found in the deterministic case, mostly persist in the presence of noise. The influence of the 
noise on the position of the phase transition, as well as on the behaviour in other regimes 
of the loading parameter a, can often be described by a simple rescaling of a depending on 
strength and type of the noise. We then consider the problem of the optimal noise level for 
Gibbsian and Bayesian learning, looking on replica symmetry breaking as well. Finally we 
consider the question why learning with errors is useful at all. 

1 Introduction and overview 
1.1 Introduction and basic definitions 

In a recent paper, |IJ, one of us has treated the problem of phase transitions in the 
generalization behaviour of two-layer neural networks with non-overlapping receptive 
fields. The architecture of the systems considered is shown in Fig. 1. It corresponds to a 
tree of totally N input units, which are grouped into K vectors £ lr .., $, K of M := N/K 
binary components £& jm = ±1, with k = 1, K and m = 1, M. 

Each one of these vectors £ k determines the binary output a k of a so-called "hidden 
unit" according to the perceptron-rule 



a k = sgn 




*based on the PhD thesis of B. Schottky, Regensburg 1996 
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where the so-called coupling vectors Wk have M arbitrary real components u>fc jm , which 
are only constrained by the normalization w\ = M. 

The final output a (='classification', 'answer') of the machine for a given input 
(='question') results from a fixed Boolean function 



of the outputs <7fc of the hidden units. This Boolean function is arbitrary apart from the 
postulate that it should be invariant against a permutation of the arguments. 

Now the task of this classification machine is to learn a certain "rule" by modification 
of the coupling vectors through learning the correct classification of a set of input exam- 
ples ^ = (£1, with \i = l,...,p. Here it is assumed that the so-called loading 
parameter a := p/N is finite, while the thermodynamic limit iV — > oo is taken. 

In the following it is also assumed that the "rule" , by which the correct answers follow 
from the questions, is implemented by a "teacher perceptron" of the same architecture 
as given above, with fixed "teacher couplings" w f . In particular we assume that the 
Boolean function of the student machine is the same as that of the teacher. However, 
the noise levels can be different, unless otherwise stated (see below). 

We consider the generalization ability g(a), see [0, p|, of the system after a training 
process with p = oe-N examples; g(a) is defined as the probability that after the training 
an additional random question is answered correctly, i.e. in the same way as the teacher 
would answer in the absence of noise. It should be stressed that after the training we 
switch off any noise, both for the teacher and for the student machine. In contrast, 
during the training, noise of various kind will corrupt both the student and the teacher 
behaviour (see below). 

Of course g(a) generally does not only depend on a, but also on the architectures 
considered, i.e. on the Boolean function B({o~k}), and on the noise. Only in the limit 
a —>■ oo, as already shown in Q, in the absence of noise the architecture does not matter, 
and one obtains for a —>■ oo asymptotically the universal result 



for the so called Gibbsian learning (see below), where a student is drawn randomly from 
an ensemble which consists in the deterministic case just of those students classifying 
the training set correctly. 

This asymptotic result is independent of the choice of the Boolean function and the 
number K of hidden units. Although, as already mentioned, our Boolean functions are 
quite general, apart from the constraint of permutation invariance, and although the 
behaviour depends essentially only on a small set of characteristic numbers (see below), 
we mention for the following that the most important machines considered are 

• the committee machine: This machine classifies by a majority vote of the K : = 
2n + 1 hidden units, i.e. with 




(2) 




(3) 



a 




(4) 
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• whereas the parity machine is defined for general K > 2 by 

K 

a = sgn a k , (5) 
fc=i 

• and finally the AND-machine by 

a = sgn(^> fc -K + l^ , (6) 

i.e. a positive classification is only given, if all hidden units agree. 

The main result of the present paper concerns the possible existence of phase transi- 
tions in the generalization behaviour as a function of a. E.g. for the parity machine, in 
contrast to the committee, generalization starts only if a is larger than a critical value, 
[0- B ("Aha effect"). In the preceding paper, this was discussed for the determinis- 
tic case, whereas in the present second part we discuss the influence of noise. Generally, 
we find that the phase transitions mostly persist, although with changed critical values, 
and we also find certain scaling laws combining the critical loading parameter a and the 
"noise strength". Furthermore, the performance of the system is found to be optimal, 
when the noise of the "student machine" adapts to that of the teacher in a certain way. 

1.2 Overview 

Since there are a lot of categories considered in this paper, it is easy to lose track. So 
we give a brief overview for better orientation. The categories considered are: 

• Two types of noise, input- and output-noise, see subsections 2.5 and 2.6 below. 

• We consider (i) the case that noise-levels of teacher and student are assumed to 
be the same (chapter 3), but also the case (ii) that the student noise level can be 
chosen to optimize the learning behaviour (chapter 4). 

• Although most results are discussed for general values of the reduced size a := p/N 
of the training set, the two limits a — > (more precise would be: 'a small', see 
subsection 3.2.1) and a — > oo are of special interest. 

• Concerning the noise strength, particular emphasis is put on the two limits of small 
and large noise level, respectively. 

• Our main results are for general two-layer perceptrons with K "hidden units", 
where K > 1; however, emphasis is sometimes put on the case K = 1, i.e. the 
single-layer perceptron. 

• There are different learning rules (section 2.2 below), and for K > 1 one has to 
distinguish the different Boolean output functions (section 2.3). 

• Our main results are obtained with the replica- symmetric approach (see below), but 
we also discuss some results obtained with broken replica- symmetry (see chapter 4). 
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Since in principle all these categories can be arbitrarily combined there is a very large 
number of combinations, but not all of them are considered in this paper. We mention 
as well that some considerations are only made for the simple perceptron {K = 1). 
The paper is now organized as follows: 

• Section 2 outlines the theoretical framework used in this paper, describes the learn- 
ing algorithms considered and introduces the two types of noise treated in this 



• The whole of section 3 deals with the case (i) mentioned above, i.e. there it is 
assumed that the noise level for the student is chosen to be the same as that of 
the teacher (which is not a bad choice). Both types of noise are considered, with 
special emphasis on the limiting cases a — > and a — > oo of the loading parameter. 

• In section 4 we investigate the impact of a varying student noise level at fixed 
teacher noise, i.e. case (ii), aiming at the optimal choice. This is done only for the 
case of output noise. Furthermore, here we concentrate mainly on the single-layer 
perceptron, taking 'replica symmetry breaking' into account. The multilayer case 
is treated only for the limiting cases of large and small training sets. 

• Section 5 deals with the question, why a finite student noise level (which means 
that some of the training patterns are not learned correctly) can be useful at all. 

• Finally section 6 presents our conclusions. 
2 Basic theory 

The answering behaviour of a student and a teacher network for given weights is defined 
by the two functions 



determining the probability of getting the final answer <t m on a question £ M if the weights 
w are given. The sub-/superscripts are 's' and 't' for student and teacher, respectively. 
So encodes both the underlying architecture and the noise process corrupting the 
answer. 

2.1 The version space and free energy 

From <fi t one derives the probability P t of getting answers {a^} for patterns with 
/i — 1, ...,p, by the teacher rule: 



where the so-called prior P w takes care of the normalization constraints. Here and in 
the following we use a p as notion for the set {a 11 } of answers, with /i — 1, ...,p. 



paper. 




(7) 
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Using the Bayes theorem, the probability that a specific weight vector w s is the 
correct one given the training patterns and the answers by the teacher, is determined 
through (f> s by 

n 1 ; fdw s P w (w s )p(aP\w s )' y> 



with 



p(.d*\w') = i[<p.(.<T'>\w;z»). (io) 



This defines the degree of membership of a specific coupling vector w s to the so called 
version space. The version space contains all student coupling vectors with a weight pro- 
portional to the probability that these couplings agree with those of the actual teacher. 
So one defines the corresponding partition function 



z(a p ) = J dw s p w (w s ) n ^Ki^f) , (n; 



where 

$ s (™>,£) = c E 4> s ((j\w s ,€) , (12) 

with ce being a positive free constant, which takes into account that the degree of 
membership has not to be normalized. 

In our case we restrict the possible functions <p s / t of the student resp. the teacher to 

s/t 

depend only on the corresponding local field values h k of the hidden nodes, 

hf : =-7^™f (13) 



so 

M ff l« -/t ^) = M ff IK 7 '}) ■ ( 14 ) 

We can now perform the Gardner analysis, J?| , by calculating 

which we will call 'free energy' although this is physically not precise. 

To describe the structure of the version space in the thermodynamic limit N oo 
we introduce the order parameters 

% := M W *' W * ( 16 ) 

n ■= (17) 

So qu is the overlap between the k-th subperceptron of two students chosen randomly from 
the version space, and is the corresponding overlap of a random student vector with 
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the teacher couplings. Nevertheless, since B is restricted to be permutation symmetric, 
these quantities cannot depend on the node number k, and thus it is sufficient to use 



T — r k 



as permutation-symmetric order parameters. In the replica-symmetric approximation, 
straightforward calculations, see ||, lead to the final result 



T = extr(q jr ) 
with the so-called energy term 



k r 

k=l L 



1 . , 1 q — r 2 
- n 1 -q) + 

2 V H > 2 1-q 



a W(q, r) 



K 

W{q,r) = / l[Dt k F t (a,{q,r,t k }) ■ lnF s {a, {q,t k }) 

J 7-1 L1 



k=l cr=±l 

and the architecture specification for the teacher machine 



and for the student machine 



F s (a, {q,t k }) = / Y[ Ds fc s (a, {s k ^Jl -q- t ky /q\ 
J k=i ^ 



(19) 



(20) 



F t (a, {q, r, t k }) = J U Vsk<Pt (v, {s k] jl-^ - ^^}) ( 21 ) 



(22) 



Here Dx = (2vr)- 1 / 2 dx exp(— x 2 /2) is the Gauss measure, and the values of the order 
parameters q, r are fixed by the saddle-point conditions 



dq dr 



0. 



(23) 



These are very general formulas which allow to calculate the order parameters for classi- 
fication machines with tree architecture. Nevertheless, for non-permutation-symmetric 
Boolean functions B one would have to distinguish between the q k , r k for different nodes. 



2.2 Gibbs and Bayes algorithms; Gardner analysis 

As training algorithms we discuss (i) the Gibbs algorithm and (ii) the Bayes algorithm. 
They are distinguished by the way, how the version space is utilized: For the Gibbs 
algorithm, a "typical student machine" is drawn at random out of the version space 
according to the weight factor (U), and then an average is performed as usual. In contrast, 
for the Bayes algorithm one takes into account all members of the version space and 
gives that answer, which corresponds to their weighted majority vote: In this way, the 
a-posterior error probability is minimized. Therefore, the answers given by this so- 
called Bayes procedure usually cannot be obtained from only one machine of the kind 
considered. 
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2.3 Notation to encode the Boolean function 

For a specific Boolean function B (with a given number K of hidden units) we define 
the following expression : 

= {l : foS'" = ' ' 24 » 

Thus A CT is 1 just for those internal representations which are mapped to V by the 
Boolean function B. We remind that only learnable problems are considered, so the 
same Boolean function specifies the architecture for the teacher and student networks, 
and thus we need no superscript to distinguish between them. 

We need as well a short code to denote a special architecure. We use the same conven- 
tion as already introduced in A Boolean function is characterized by its number of 
nodes K and a special 'mcode' q with q = Y,„=o n„2 u . Here n v — or = 1, respectively, 
if a positive vote of exactly v hidden units leads to a negative resp. positive final output 
a of the Boolean function. (By convention, v = K shall always imply a positive output.) 
The name is then combined to 'Ki^_mcodeg', so for example 'K4_mcode2' is a network 
with 4 hidden units and positive output if exactly 4 or 1 hidden unit(s) have positive 
vote. 



2.4 Error probability 



The order parameters describe pretty well the learning success (or failure) and we will 
often present just these values. Nevertheless the quantity, which is so to say of final 
interest, is the generalization ability. In the presence of noise, there exist of course 
different possibilities to define this quantity; our choice is, as stated already above, to 
assume that after training the noise is switched-off completely, both for the teacher 
and for the student networks. Moreover, we refer rather to the generalization error 
e(a) := 1 — g(a) measuring the probability that student and teacher disagree on a new 
question. 

For a given architecture this generalization error is determined uniquely by the values 
of the order parameters q(a) and r(a), no matter which noise processes have influenced 
the learning. 

For the Gibbs algorithm e depends only on the typical student-teacher overlap r and 
is given by (see as well eqn. (21) in Ref. [1]): 

K 



e r 



E 

a=±l 



Tr TrA CT ({a<})A- CT (K})II f 1 " " aiccos^r) 



For the Bayes algorithm we obtain the generalization error by 



ayes 



K 

Y\_ Dtfc min 
fe=i 



Tr A 1 ({a k })]]_H(a k jt k ), Tr Ar\{<r k }) U H ^kjU 

{rTk} k { " k} k 



(25) 



(26) 



where the " min" in Eq. ( p6|) means that the error probability corresponds to the smaller 
fraction of the version space, which belongs to the minority votes. Note that in (|26| ) the 
values q(a) and r(a) of both order parameters are important. 

For the simple perceptron, and for the parity machine in general, there is a close 
relationship between (EH) and (EB); we will return to this point later. 
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2.5 Output Noise 



We have investigated the influence of two types of noise. The first one, called 'output 
noise', flips the final output with the probability 



This corresponds to 



" = TT^- (27) 



/ r t u exp(-/3 ■ 5[-a, B({sgn(h k )})} 

^ {hk}) = l + exp(-/3) ^ 



as probability, to get the output a for given fields {h k } at the hidden units. In Eq. (j28[), 
we have written k] for the Kronecker symbol (= 1, if i = k, = for i ^ k), and 
is the parameter characterizing the noise strength. 



2.6 Input noise 

The second type of noise, called 'input noise', causes a noise-perturbation of the local 
fields at the hidden units, h k — > h k + 77, where 77 is a random variable with the Gaussian 
distribution p(rj) = (2tt'j)~ 1 ^ 2 exp(— r] 2 /(2'j)). Similarly to the case of output noise, it is 
natural to define the noise strength f3~ l := 7. 

The origin of this noise can be that the input pattern itself is subjected to corruption 
by noise or that there is weight noise in the couplings of the teacher. 

The flip-probability p/(7) depends here on the architecture, namely it is 

P/(7)=e( V / (l+ 7 )- 1 ) , (29) 



with the generalization error e(r) already known from Eq. (|25|) . 
The probability 0, to answer on {hk} with a, is 

0(a, {h k }) = Tr A CT (K}) [J H(-a k h k J (3) , (30) 

k=i 

where H{x) := f™ dMexp(-M 2 /2)/ v / 2T. 

Eq. (|29l) can be seen as follows: The generalization error e(r) defines the probability 
that student and teacher machine, which have overlap r, give a different answer on a 
question. The local fields at a node k of the respective machines can be written as h l = t, 
h s = t ■ r + v ■ \/l — r 2 , with two independent, normally distributed random variables t 
and v . The flip-probability, for comparison, defines the probability that the local field 
h°, by adding noise with average and variance 7, is changed to h 1 , where h = t and 
h 1 = t + f-v/7, such that the final answers differ. Thus the problems are completely 
analogous, with the exact correspondences 



r 



2 



V7, or r = J—. (31) 
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With one can easily discuss the small-noise limit in this case, since for r — > 1, 
according to Eq. (33) in one gets 

e(r) -> n c 7r-y2(l - r) , (32) 
which follows e.g. from Eq. fl25|) for r — > 1; thus one obtains 

P/(7-0) -> ^-tt- 1 ^- (33) 

Here n c is, in the limit considered, the only architecture-dependent value determining 
the asymptotics e(7 — > 0). It characterizes the 'border-regime' of the Boolean function 
B({a k }), namely by 

n c =(-J N c , (34) 

where N c is the number of all those possible K ■ 2 K bit- flips of the outputs of the hidden 
units, which would lead to a change in the final output, see Eq. (35) in [1]]. 



3 Teacher and student machine have identical noise-levels 

In the following we assume at first <p s = <p t , which means that the student machine uses 
the known noise-level of the teacher machine. This assumption, which implies r = q, is 
natural, since in this way overfitting will be avoided; moreover, as we will see later, it is 
not too far from the optimal choice. 



3.1 Free energy 

After the preparations in the last section we can calculate the free energy for both noise 
types. One gets from Eqs. (|i~9l) and (|20| ) 

T = extr (9) |iln(l - q) + | - a ■ W{q)} , (35) 

where for the case of output noise 



k 

x ln[ Tr A CT (K}) ]J H{a hl t k )\ (36) 

with 



A°({a k }) := A ff ({a k }) + exp(-/3)A- ff ({a k }) (37) 

and for input noise 

W(q) = ~ /ll D ^E Tr A°({a k })HH(a^t k )ln[ Tr A ff {{a k })l[H{a k jt k )] (38) 

J 1,. ~ ^k) , {CTfe} , 



with 



These formulae will be generalized below for the cases of different noise-levels of the 
student and of the teacher machine, and for 1-step replica symmetry breaking (RSB1). 
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3.2 Identical noise-levels: Results for the case of output noise 
3.2.1 Small training set: q — > 

The limit of a 'small' training set is best characterized by the corresponding limit q — > 0; 
in some cases this implies a — > 0, but sometimes the corresponding limiting a can have 
a finite values as well (see below). The opposite limit q — > 1, which implies a — > oo, will 
be considered in the next subsection. 

For q — > a redefinition of the correlation moments a u m of Eq. (47) in [Q suffices to 
capture the influence of the noise considered. With 



a m 



i x K m 

' TrA CT ({a fe })n^ (40) 



one defines 



_ < + e-V- CT _ fog, tanh(f), for m > 1 

1 + e-/ 3 \a^tanh(|) + I ^, for m = ' 1 ' 

Therefore, the b a m fulfill the same algebraic relations, Eq. (51) in [0, as the a a ml namely 

W m = -b-^ and ^ + ^ = 1. (42) 

In particular, the so-called order-index n, see below, is unchanged; n is defined by 

b a n ^ 0, W m = for 1 < m < n . (43) 

Moreover, for W(q) one gets to lowest order in q 

W (q) = -661-66 - V ln6- - Kl + e-) - f g) 2 f) ig + ... 

=: w -g n ^i + --, (44) 



which agrees completely with Eq. (54) in [0] apart from the replacement of a a m by b a m 
and by the non-essential additional term ln(l + e _/3 ). So the results for q(a) in [[U can 
be simply generalized by these replacements. Only with the generalization error e(a) we 
have to keep in mind that after training the noise is switched off, such that for e(a) the 
of m must be kept. 

Concerning the order-index n, we thus can state as in |J that 

• for n — 1, e.g. for the committee machine, the overlap q(a) increases for q < 1 
proportional to a, i.e. there is generalization right from the beginning, and one gets 

q(a) - 2^-^ . (45) 

7T blb 

This case happens for 6 of the complete set of 9 examples with K = 4 hidden units 
given in Fig. 1 of M. 
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• For n — 2, phase transitions of second order or of first order (or both) are possible, 
as discussed in detail in section 7.2 of M. Generally, for n = 2 the network is 
purely guessing, i.e. the error probability is 1/2, as long as a is between and the 
first critical value, whereas an increase of a beyond this critical value leads to a 
continous (resp. discontinous) increase of the generalization ability in the case of a 
2nd-order (resp. lst-order) transition. These transitions with q(a) = for a < a c 
are called by us 'Aha-effect transitions'. As just stated, they appear only for n > 2, 
whereas for n = 1 only so-called 'interim transitions' (if at all) can happen: At an 
'interim transition' q(a) is finite already below a c . 

If for n = 2 the 2nd-order transition is not preceded by a lst-order one, the critical 
loading is 

c iK(K-i)(b\y [ ' 

The case n = 2 happens two times in Fig. 1 of Q. 

• For n > 3, as in the noise- free case, one always gets a first-order transition at 
a critical a c > 0. Nevertheless this a c has to be obtained numerically since the 
behaviour around a q = q c with q c > is relevant. 

This case happens e.g. for the parity machine with K > 3, which has n = K. 

Thus, as long as there is no transition of first order, the behaviour can be described 
analytically by looking how the noise strength changes the correlation moments b a m . 
Some of the following statements are based on this fact; they are not exact as far as 
the locations of first order transitions are concerned, but we do not always state this 
limitation explicitly. 

If one considers only machines, which have the same probability for the two possible 
outputs cr = ±1, these results can be simply condensed into a rescaling 

a — > a e g := a ■ tanh 2 (^j = a • (1 — 2pf) 2 , (47) 

where pf is the flip-probability defined in Eq. flSTp. This can be intuitively understood 
as follows 

• p ■ (1 — 2pf) is the "uncorrupted fraction" of the training set. 

• The results are affected by noise from both the teacher and the student machine, 
which explains the power of 2 in Eq. 



3.2.2 Identical noise-levels; output noise; large training set: q — > 1 

In this limit, the teacher network is approximated with arbitrary accuracy, for every noise 
strength f3~ l . That this is possible, is not at all self-evident: The training set contains 
mistakes since the teacher makes errors, the student learns this set making errors as well, 
but in spite of these facts the teacher machine is approximated perfectly. As we will see 
below, a bad training strategy could empede generalization; so the fact that the student 
accepts the noise level of the teacher, as assumed at present, is already a good strategy, 
although it is not yet optimal, as we will see later. 
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From Eq. ( |3"6D one obtains for q — > 1 the information gain 



-> p/[ln(l - p/) - lnp/] + n c wi(/3)Jl - q 



(48 



with 




1 + e^ 



if (u) In 



1 + e-P 



In [e^if (u) + if (-«)] + 



H(u) + e-PH(-u) 
H(-u 
1 



) + e-PH{u) 

-In [e^ff^) + H{-u)\ 



.(49) 



Determining 5(a) from W(q) and inserting the result again into Eq. ([£J), one obtains 

V2 



e [a 



Wi(j3)ira 



(50) 



So again, as in the deterministic case, one has an 1/a-asymptotics, and the prefactor does 
not depend on the Boolean function B({<Jk}), but only on the noise-level (3. Therefore 
again, one is lead to a rescaling 



a — > a e fi : = r {P) ■ ot = — ; — - • a = 



0.720647 



a . 



(51) 



In Fig. 2, the scaling parameter r{0) = Wi((3)/0. 720647, which applies to the regime 
q — > 1, and the "intuitive" scaling parameter r**(/3) := [1 — 2pf((3)} 2 , which applies to the 
limit q — > 0, are presented as a function of the "flip parameter" 2pf, which corresponds to 
the "corrupted fraction of the training set" . Obviously r(/3) is < r lt (/3), which means that 
for large loading the effect of noise is stronger than for low loading. But the difference is 
not large, so that the "intuitive reduction factor" r lt (/3) always will give a good estimate 
of the effect. But it should be noted that in the limit q — > 1 already a slight corruption 
of the training set leads to a significant reduction of the generalization ability because 
of the infinite slope of r(2pf) in the limit pf — > 0. 



3.2.3 Output noise: Data collapsing in the large-noise limit 

Strong noise is described by (3 — > 0. Calculating the behaviour in this limit for both, 
q — ► and q — > 1, one sees that q(a, (3) is oc (3 2 a. We have looked for (approximate) data- 
collapsing in the whole parameter region 0.2 < (3 < 1, by plotting the curves g(a,/3) not 
only as a function of a, with (3 as curve parameter, but using, instead, also the product 
j3 2 a as scaling variable. The results are compared in Fig. 3 and show that with the 
variable /3 2 a, for (3 < 1, a good, although still approximate, data-collapsing is obtained 
in the whole region < a(3 2 < 00. This data-collapsing becomes asymptotically exact 
in the above-mentioned limits q —>■ and q — > 1. 



3.3 Identical noise-levels: Results for the case of input noise 
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3.3.1 Small training set: q — > 

Again we consider at first the limit q — > 0. In this case one gets from Eq. (|39| 



and for the free energy 



7-^ — =:^, (52) 



^ -> extr (g) j-^ - a . («,„ - <f O^) j , (53) 



where w and Wi are defined as 



i i -i -! 1 /2\ n (K\ (a 1 ) 2 

Wo = -ao In ao - a lna , wi = - - • (54) 

2 \7T/ V n / aAa n 



i U,Q 

Again, the order-index n, and thus the qualitative behaviour, remains unchanged by the 
noise. Concerning the three cases of n, we have now 

• For n = 1, the q(a — > 0) is given by 

q(a) -> 2aC-\4r ■ (55) 

Since ^ < 1, the noise diminishes the overlap. 

• For n — 2, the critical value a c of the 2nd-order phase transition (if it is not 
preceded by a lst-order one, see above) shifts to a higher value: 

« C (C) = ^« c (0) . (56) 

• For n > 3, there is again a lst-order phase transition, but the resulting critical 
loading must be determined numerically, since the behaviour around a q = q c with 
q c > is relevant. 

Taking all three cases together, we find a rescaling, which also applies for n > 3, namely 

a — > a e fj = ( n ■ a . (57) 

We remind again on the caution which has to be taken as far as first order transitions 
are concerned. 

3.3.2 Identical noise-levels; input noise; large training sets: q — * 1 



Considering the limit q — > 1, i.e. a — > oo, one sees from Eq. ( p9|) that now 7 does not 
converge to 00, in contrast to the behaviour without noise, or with output noise. Instead 

7- + (58) 
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Expanding Eq. Q3E|) for 1 — q — > 0, one obtains 

W(g)^™ (/?) + (l- ? )-™i(/3). 
If one determines from Eq. (^) and inserts it into Eq. 

. , 1 



one obtains 



e a 



(59) 



(60) 



with 



MP) 



2V2^ 



7 it CT =±i {<Tfc> 



-^ /2 n 

fc(^m) 



x In 



Tr J] H(a k t k J(3) 



{<>•*} 



fc(^m) 



(61) 



From Eq. fl60|) one can see that mpu£ noise, in contrast to output noise, leads to a drastic 
deterioration of the generalization ability, namely from an asymptotics as e(a) —>■ c/a 
to the slower decrease e(a) — > cj \fa. Additionally we find that the prefactor c of this 
behaviour - in contrast to c - depends on the architecture. In Fig. 4, for the K — 2- 
and K = 3-parity machines and for the K = 3-committee, we present this prefactor c 
of the asymptotic behaviour e(a — > oo) — > c/ >/a as a function (i) of 7 and (ii) of the 
"flip probability" pf = n c i\~ x y/j. Interestingly, with the last-mentioned representation, 
the data almost collapse to a single curve, although the results look quite different when 
presented against 7. 

For the simple perceptron a corresponding result is given in M. 



3.3.3 Input noise: Data collapsing in the large-noise limit 

For the case of input noise it can further be shown that in the high-temperature limit 
(3 — > and for a — > 00 

e(a)^n c y^-=±=, (62) 
7r \/VoP n a 



with 



Tr/ ^ aft Vn 



The corresponding behaviour for a —>• can be seen from ([)7|). Summarized, data 
collapsing in the limit (3 — > both for g — > and g — > 1, but with n-dependent rescaling 
a — > a e fr = /3 n a. 

In Fig. 5, for two examples, we check whether for the input noise temperatures 
7 = 1//? = 1.0, 2.0,. ..,5.0 one gets data collapsing. Thereby we use ( = 1/(1 + 7) rather 
than (3 for rescaling which is a better choice if (3 is not close to 0. We compare results 
for q plotted as a function of a with results, where q is plotted against a/( 2 for the 
K = 2-parity machine, resp. against a/( for the K4_mcode2 machine. 
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3.4 Identical noise-levels: The different impact of output- and input noise 

As already seen there are some significant differences of the impact caused by output- 
and input noise, respectively. Let us first have a closer look on the influence of a small 
disturbance by noise. 



3.4.1 Small-noise limit: (3 — > oo 

To compare the impacts of noise in this limit we have to distinguish between the cases 
of small and large reduced size a = p/N of the training set, respectively. For small a 
the effect of input noise is for 7 — > (or (3 — > 00) according to Eq. (|57|) 



a eff = C ■ " = (1 + lY n/2 ■ ol -> (1 - ^7) a . (64) 
With the flip rate pf given by Eqn. ( |33|) in this limit we have for input noise 

a eS = a-[l~^(^) 2 ]. (65) 



n, 



The training set is therefore only reduced by a small amount oc pi. In contrast, for 
output noise this amount is oc pj. For machines with equal probability for final output 
a = ±1 this can directly be seen from Eqn. (£|7|). This means that a small amount of 
input noise does hardly matter for the case of a — > 0, in contrast to a small amount of 
output noise, which - so to say - instantly deteriorates the behaviour. 

On the other hand, for a — > 00 a small amount of input noise induces a qualitative 
change in the asymptotics, since then the behaviour is shifted from the 1 / a-asymptotics 
to the slower l/y^-decrease, if a is beyond the corresponding (non-universal) crossover 
value. In contrast, for output noise the 1/a-behaviour is qualitatively unchanged, only 
the prefactor is increased. 



3.4.2 Input noise: Disappearance of phase transitions 

A further difference concerns the phase transition of the K4_mcode2 machine: Here the 
intermediate phase transition as existent in the deterministic case (see |J) disappears 
for the case of input noise. This does not happen for output noise. Whether other 
intermediate phase transitions are affected similarly has to be checked. 

Nevertheless all other types of phase transitions (see subsection 3.2.1 above or [0) 
persist for both cases of noise, although the shift of the critical parameters depends of 
course both on the strength and on the type of the noise. 



4 Optimization of the noise-level of the student machine 

In this section we focus exclusively on output noise, but now different noise-levels for 
student and teacher networks are allowed. The following two sections ^J] and |4.2| give the 
theory for the general multilayer case. The formulas are evaluated mainly for the simple 
perceptron; for the general case just asymptotic results within the replica symmetric 
formalism are given. 
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4.1 Different noise-levels; output noise: Replica-symmetric formalism 



In the replica-symmetric formalism of the preceding section one has now two different 
noise strengths (3 t and (3 S of the teacher resp. the student machine, and additionally now 
q^ r. Therefore, instead of Eqs. (|35|) and ( |36"D one has 



where for the present case of output noise it is 

1 



q) 



a ■ W(q, r) 



(66) 



W(q, r) 



n^E r 



+ exp(-/3 t ) 
ln[Tr A^({a k })\{H(a kl t k 



Tr A°({a k })l[H(a klr t k 



{<>■*} 



(67) 



with 



7 



and 7 r 



\Jq - r 2 



(6* 



and with A defined by Eq. fl37|) with (3 = (3 t and respectively. The functions q(a) 
and r(a) follow again from the saddle-point conditions (|23|). An additional quantity of 
interest is the relative training- error e tr := E tr /p = —a~ 1 dJ r /d/3 s ] €t r is thus the fraction 
of the training set, which is misclassified by the student machine. A straightforward 
calculation, see ||, yields 

1 , k Tr A a ({a k }) H k H(a k ^ r t k ) 

l + e^y fc=1 CT=±1 Tr A CT ({a fc })rifc^(^fc74) {CT *> * 

(69) 



4.2 Different noise-levels; output noise: Replica symmetry breaking 

Within the usual 1-step replica symmetry breaking (RSB1) scheme, see 0, one gets with 
qa,b = i f or a = ^ qa-fi = gr 1; if q anc l 5 are different, but belong to the same subgroup of 
m of the n replicas, and q a,b = go e l se > an d with Ag := qi — g : 



with 

W(r,g Q ,gi,m) 



Qo — r 1 , \ 1 , /, mAq \ a XTT/ 

2(1 — gi + mAq) 2 2m \ \ — q\) m 

(70) 
t k ) 

j - 1- 

t ky /cfo + v kv r Aq~\\ m \ 



T V 1 + exp(-A) Y Vg - r 2 



x 



\n \ [Y[Dv k (jr A°({o- k })Y[H 
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4.3 RS and RSB results for the simple perceptron 

In the following we use the notation 'perfect student' or 'perfect learning' to describe the 
fact that the given training set TS is learned without any error, so 'T«S-perfect' would be 
a more precise terminology. 'Non-perfect learning' (or better 'TiS-nonperfect learning') 
means that errors with respect to TS are made. We stress at this place that TS may 
already be corrupted with respect to the original rule. (We also mention that 'perfect 
learning' is sometimes as well used to describe coinciding architecture of student and 
teacher which is here the case anyway. To keep our special definition in mind, we always 
use primes in the terminology 'perfect'.) 



4.3.1 RS results: 'Non-perfect' teacher, but 'perfect' student 

In the following, for simple perceptrons (K = 1) we consider at first the case of a 'perfect' 
student machine (i.e. necessarily f3 s = oo) but allow for a 'non-perfect' teacher (fi t < oo), 
which means that the training set itself is partially corrupted, since the answers given 
by the teacher on the questions £ M (/i = 1, ...,p = a N) do not always follow the rule, 
but are partially random. 

Of course, 'perfect learning' of the corrupted training set is then possible only up to a 
specific a c depending on the noise; e.g. if for all input patterns £ M the outputs, prescribed 
by TS, would be randomized with respect to the original rule, then one would get the 
famous result a c = 2 of E. Gardner, 0. 

In Fig. 6a, both for the case of output noise and for input noise, a c is presented 
as a function of the non-corrupted fraction (1 — 2pj) of the training set, with pj taken 
from Eqs. ( p7| ) and fl29|), respectively. Additionally, in Fig. 6b, for output noise with 
T t = 1, the curve r(a) for Maximal Stability Learning (MSL) is shown (the solid line) 
and compared with the corresponding result for Gibbs learning (the dashed line). For 
the MSL case, the student is not a random member of the version space but has those 
couplings, which lead to maximal stability of the classification of the whole training set 
of patterns mapped to +1 and —1, respectively. This specific member can be obtained 
by the well-known AdaTron algorithm, JlO] . 



Two overfitting effects can be seen: Considering Gibbs learning, the overlap decreases 
at the end of the curve. Compared to MSL, Gibbs learning is worse for small a (which 
is expected) but becomes better for a — > a c , although MSL chooses a specific vector as 
student which is supposed to perform the classification task very well. 

These are hints that training with noise might avoid overfitting effects of the student 
and therefore could lead to an enhanced performance : 



4.3.2 RS results: 'Non-perfect' teacher and 'non-perfect' student 

Here we consider again the perceptron (K = 1) and assume a given output noise-level 
fit = 1 of the teacher machine. 

In Fig. 7 we present the dependence of various quantities on the noise strength T s := 
l//3 s of the student machine. These are 

• The overlap r(T s ) of the two coupling sets. This quantity determines the general- 
ization ability of the Gibbs training algorithm. 

• The typical overlap q(T s ) of two student perceptrons. 
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• The overlap r cp of the so-called " central-point network" with the teacher machine: 
The coupling vector of the " central-point network" is obtained as a (weighted) aver- 
age of the coupling vectors of all student perceptrons forming the Gibbs ensemble. 
Analogously to Eqs. (75) and (76) in one can show that r cp (T s ) = r(T s )/ <Jq(T s ). 
The corresponding generalization error is obtained by plugging r cp into (|25|) . 

Moreover, for f} s = fit the corresponding network turns out to have the same gen- 
eralization ability as an exploition of the version space performed by the Bayes 
algorithm, see [l]. 

From the results of Fig. 7, the following points should be noted : 

1. For all values of a, r(T s ) and q{T s ) cross at T s = T t . In fact, for this case the 
teacher has the same properties as a typical student of the Gibbs ensemble. 

2. The curves r cp (T s ) have a flat maximum at T s — T t . This is also obvious: Since 
in this case the central-point network reaches the generalization ability of the Bayes 
algorithm, which is maximal, according to information theory. Nevertheless, since the 
curve r cp (T s ) is very flat around the maximum, the detailed value, T s = T t , is non- 
essential. 

3. In contrast, the overlap r(T s ) for Gibbsian learning shows a maximum for a finite 
noise level T s only for a = 2 and a = 5, but not for a = 1. Obviously for Gibbs learning, 
training with output noise (i.e. T s > 0) is only advantageous beyond a finite a, which of 
course depends on T t . For a similar model this was reported already in | TI| . 

Fig. 8 presents the optimal value of the student noise-level T s as a function of a 
for fixed T t = 1. Beyond a = a c (~ 0.6 in Fig. 8) training with noise leads to better 
generalization. We determined the value of a c only numerically from the appearance of 
a maximum. For large a, the optimal T s for T t = 1 converges to 0.60524. 

In Fig. 9, the training error e ir is presented as a function of T s for T t — 1 and for 
different a- values, ranging from a = (lowest curve for small T s ) to a —>■ oo. For 
T s — > oo all curves converge to e tr = 0.5, as they should. For a — > the result is only 
determined by /3 S , and for a — » oo only by j3 t , namely 

g— /3 S g— fit 

Ctr(Ts)\a-+Q = :T~ ~ T ' e tr(T s )\ a ^oo = ~ — 5" ■ (?2) 

I _|_ g Ps I _|_ g Pt 



4.3.3 Different noise-levels; simple perceptron: RSB results 

The non-monotoneous behaviour of r(T s ) for a = 2 in Fig. 7 is not an artefact of replica 
symmetry: For a = 2 we determined an optimal T s > 0, and since the problem is 
learnable for this a and every T t , replica symmetry is correct. 

However, when a becomes > a c (which is always larger than 2 according to Fig. 
6a), one expects replica symmetry breaking (RSB). In Fig. 10, for a = 5 and T t = 1, 
the order parameters r(a) and q(a) are presented, as obtained in RS (the bold line) 
and RSB1 theories. Obviously, replica symmetry is correct around the optimal student 
temperature T s w 0.35 and beyond, but for smaller values (T s < 0.27) replica symmetry 
is broken. In fact, in the RS approach, a paper of Gyorgyi, predicts (for a similar 
model) a rather large and useful effect of "training with noise" already for the case of 
small T s . However, due to RSB corrections, see Fig. 11, the benefit of noise is weaker 



than expected in the RS calculation of [1 L| since for the case of small noise of the student 
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machine the overlap r RSB calculated in RSB1 is higher than calculated with RS, and the 
value obtained in RSB1 for a < 0.27 is almost as high as that obtained at the optimal 
value T s « 0.35, where RS is correct. An extension of replica-symmetry breaking e.g. to 
RSB2 (see |J) may even enhance the value of r{T s ) calculated in RSB1. 

At this place we mention that T. Uezu, in a recent preprint, [0, has independently 



treated problems studied in this section 4.3, with largely overlapping results. 



4.3.4 Why RSB occurs: An intuitive explanation 

As stated above, RSB can occur if a becomes > a c (which is always larger than 2 
according to Fig. 6a). Nevertheless RS is restored if the student temperature T s increases 
above a critical point as demonstrated in the previous paragraph. 

To understand this, let us consider the case T s — > 0, which means that only student 
machines with minimal training error are allowed. The available phase space will then 
separate into disjunct parts: To every disjunct part student machines belong, which 
misclassify a certain (different) minimal set of patterns of the training examples. The 
student machines "in-between the disjunct components" make more errors: If T s is 
increased, they also become "more and more allowed" until finally the allowed region 
of phase space melts together to a single component, i.e. replica symmetry is restored. 
This scenario is presented qualitatively in Fig. 12. 



4.4 Different noise-levels; output noise: RS and RSB results for multilayer 
networks 

Since the numerical effort increases drastically for multilayer network we consider exclu- 
sively output noise and restrict ourselves to give just asymptotic results for this case. 
For T t = 0.2, 0.5, 1.0 and 2.0 the ratio T s /T t is varied. 



4.4.1 Large training sets: q — > 1 

We ask, how the student temperature T s should be chosen in the case of large training 
sets. The aim is to obtain an optimal prefactor for the asymptotic behaviour of the 
generalization error, e{a) — > Co/a. This can be calculated analytically for given T t , 
results are presented in Fig. 13. There, for 4 different values of T t we present results for 
the ratio co/c°, pt of the coefficients Co of the asymptotic behaviour e(a) — > Co/a. Here 
Cg pt refers to the optimal choice of T s for given T t . The results have been plotted as a 
function of T s /T t . Again we find that the optimal choice is T s /T t ~ 0.6, but with a "flat 
behaviour". Note that these results do not depend on the architecture, in contrast to 
corresponding results for input noise. 



4.4.2 Small training sets: q — > 

The crucial question in this limit is, in which way the above-mentioned phase-transitions 
shift when a noise strength T s ^ T t is used, in particular whether there is an optimal 
yopt _^ y t f or which the "Aha-effect" happens earlier, i.e. for smaller a. 

For the parity machine this question can be answered at once: There the choice 
T s = T t means that the already mentioned "central-point network" with overlap r j -Jq 
reaches the Bayes generalization ability, which is optimal, i.e. eBayes(<z) = e Gibbs(v^)- 



19 



Therefore, for the parity machine the " Aha-effect" phase transition cannot happen earlier 
than calculated for T s = T t . 

For other networks, a slight addition to the argument is in order, since now the gen- 
eralization ability of the " central-point network" is smaller than that one obtained with 
the Bayes prescription. But for T s = T t the typical student is also a typical teacher, i.e. 
we have an ensemble of student coupling vectors, which corresponds to the a-posteriori- 
probability that a certain coupling vector is that of the teacher machine. The Bayesian 
generalization error can be calculated from the expectation value of q for this ensemble 
by means of ([26|). This implies that the Bayes generalization ability becomes trivial 
(i.e. the error probability is 1/2) below the critical a calculated for T s = T t (Since the 
Bayes generalization is optimal, a variation of T s cannot lead to further improvement). 
So a transition to nonzero q cannot occur earlier for whatever T s one chooses, and thus 
for all networks considered the phase transition cannot appear for smaller a. 



5 Why is training with noise useful ? 



That training with noise can be useful, is already 'folklore', see e.g. [JIzJ or |i3| , and 
in the present context it is also known that in this way one can avoid overfitting (see 
13, [3]). Here we want to go somewhat more into the details and look into the phase 



space structure. 



5.1 Survey of the phase space of a small system 

Let us first define the so called 'genuine error' E of the student: This quantity counts 
the number of patterns from the training set where the (deterministic) answer of the 
student disagrees with the original rule and not with the partially corrupted answer of 
the teacher network. 

Now we perform a simple survey of the phase space of a small perceptron with K = 1 
and N = 10, with normalized randomly chosen coupling vector w l with N components, 
and with a given set of p = 20 questions i.e. we have a = p/N = 2. By flipping 
5 of the 20 answers given by w 1 , we have our teacher perceptron endowed with an 
output noise level of pj = 5/20 = 0.25 and with a training set consisting of the 20 
pairs of questions and the partially corrupted answers. Then a large number of student 
vectors w s are drawn randomly from a uniform prior over all normalized real vectors 
with iV = 10, where the random components are sampled from a Gaussian distribution 
with zero average and variance 1/N. Finally, for each w s , we evaluate (i) the error E tr 
with respect to the actual training set, i.e. the corrupted one, (ii) the error E with 
respect to the uncorrupted answers on the training patterns, and (iii) the actual overlap 
r with w l . The results are condensed into a two-parameter table: For each combination 
of the values of E tr and E the number of vectors (yielding these errors) as well as the 
corresponding averaged overlap r is stored. 

This table can be used to calculate the r(T s ) curve for the specific values of a = 2 
and pf = 0.25. This curve is shown in Fig. 14 (the solid line) and compared to the 
theoretical result with a = 2 and T t = 0.91 (corresponding to pj = 0.25) which has been 
evaluated analogously to Fig. 7. We see that there is actually - in spite of the smallness 
of the simulated system - a nice qualitative similarity (more would be unexpected) in 
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the behaviour of r(a) as a function of T s ; in particular we find that it is useful to increase 
the noise until T s « 0.4, whereas for larger T s noise is more and more detrimental. 

Looking more directly at the performance of this small system, we note the following 
numbers from a typical realization of the survey through its phase space: In this survey 
the combination (Eq = 5, E tr = 0) was realized 215 times: These 215 students made 
no error with respect to the actual (i.e. corrupted) training set, and thus they made 
Eq = 5 errors with respect to the original rule. Possible values for E tr = 1 are then (i) 
Eq = 4, i.e. one of the 5 errors of the student with respect to the original rule has been 
corrected, or (ii) Eq = 6, i.e. one additional error has been made. In the simulation we 
found that (i) happened in 1746 cases, whereas (ii) was less frequent, namely 1146 times, 
although there are many more combinatorial possibilities for case (ii). The corresponding 
averaged overlaps are of course increased for the case Eq = 4 and decreased for Eq — 6. 
So obviously the system is able to correct errors previously made with respect to the 
original rule, and to increase the typical overlap r in this way. 



5.2 The error eo with respect to the uncorrupted training set 

This can as well be confirmed theoretically. Defining eo = Eo/p as the fraction of errors of 
the student with respect to the original training set, one derives the following equation : 

_ 2 f^ H(i r t)H(-it)e-e° + H(-i r t)H(it)e-& 
e ° " i + e -A J Ut H ^t) + e-f>'H(-yt) ' 1 ' 

with 7 and 7 r defined in fl68|) . In Fig. 15, e is plotted as a function of T s for a = 2 and 
T t = 1. So with increasing T s , €q decreases at first, which means that at first the system 
mainly corrects the mistakes contained in the corrupted training set with respect to the 
original rule. But after a minimum, eo increases again for larger T s and approaches the 
'training error' e^ (i.e. with respect to the corrupted training set) for T s — > oo. 

The reason for this ability to produce the 'right' errors can be explained in terms of 
a sort of energy-entropy competition: 

Allowing a given student to make errors there are more possibilities to increase the 
number of errors than to reduce this number; nevertheless, there are many more students 
in the version space classifying the training set with a decreased error number than with 
an increased one. 



5.3 Multifractal phase space analysis 

A more thorough analysis of the phase space structure is possible with the recent 'mul- 
tifractal technique' of Monasson and Zecchina ( flT5| , IfJ). Extending this analysis to 



problems with noise, the distribution of the phase space volumina V T weighted with the 
corresponding degree of membership corresponding to a set r = {<T Ai }i At= i ) ... )P of answers 
on the training questions £ M is given by the quantity 



9[m) - mN 



^(ln£(e-^V r f \ . (74) 



Here m is the inverse of a formal temperature and controls, how strong V T is weighted. 
Formally (—g(m)) is a free energy derived from the partition function 

Z = e~ mEr (75) 
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with the energy E T := — ln[e P aEtr ( T 'V T ]. Then the quantities 

and c{k) ..-_^± (76) 

correspond formally to internal energies and entropies w.r. to T m := 1/m, and measure, 
how the phase-space volume V T and the corresponding number of realizations scale in the 
thermodynamic limit, namely as e Nk and e Nc ^ k \ respectively. Here k < 0, but c(k) > 0. 
These quantities can be calculated by a formalism, which resembles a RSB1 calculation, 
see JT5], [T5|, ||. The dominating behaviour, which is already calculated in the usual RS 



calculation, m = 1 , is noted by the asterisks in Fig. 16 and follows from the identity 
d[c(k) + k]/dk = 0. In this Fig. 16, the results are for the simple perceptron: The overlap 
r(k) is presented as a function of —k, for a = 5 and the teacher output noise temperature 
T t = 1, for different values of T s , namely T s = 1.0, 0.8, 0.6, 0.4, 0.3, 0.2, 0.16, 0.14, 0.12, 
and 0.1, from the left. In this way, a differentiated picture of changes in the phase-space 
distribution induced by changes of the noise temperature T s can be given. So there 
are two effects: With increasing temperature, regions with high student-teacher overlap 
become more and more active. At the same time, the volume determining the typical 
(dominant) overlap, i.e. the position of the asterisk, moves towards the maximum of 
these overlap curves. After having reached the optimal temperature, the curve decreases 
again, and regions with high training error and small overlap begin to dominate the 
phase space. 



5.4 Why noise is useful: an intuitive picture 

Summarizing the calculation and results of this section one can give an intuitive picture 
for the observed behaviour: In Fig. 16 we plot a scenario corresponding to a slice through 
a lake with a flat bank on the left, but a steep shore on the right. 

To get the basic point let us think of the simple perceptron in a regime where replica- 
symmetry is preserved even for T s = 0, thus the (corrupted) training set is learnable. 
The deepest point in the lake corresponds to the error-free solution(s) with respect to 
this (corrupted) training set. 

An increase of the student temperature T s corresponds to an increase of the water 
level starting from this deepest point. Of course in the case of Fig. 16 the direction of 
the flat bank is favoured compared to the steep shore, so the center of mass shifts to the 
left for increasing water level. 

This can be transfered to our model: 

• The direction of the flat bank (on the left) corresponds to noise-induced flips in 
the answers which correct wrong answers in the corrupted training set. The corre- 
sponding phase space of vectors is large (1746 students with a genuine error level, 
i.e. with respect to the original rule, reduced from Eq = 5 to Eq = 4 in the example 
of section 5.1). 

• On the other hand, the steep shore on the right means that the phase space cor- 
responding to detrimental changes of answers, which are correct in the training 
set, should be significantly smaller (1146 students with an enhanced error level, 
E = 5^E = 6). 
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So, corresponding to the "water dynamics" scenario of Fig. 17, if the noise level is 
enhanced, the additional accessible phase space is dominantly situated on the left. The 
corresponding flat bank corresponds to just those solutions which correct errors instead 
of adding new ones. Since the former are as well those having (intuitively) an increased 
overlap with the teacher, 'training with noise' can be useful. 

But what happens, if T s is increased too much, corresponding to flooding the lake ? 
First of all, we have to notice that the 'optimal' solution is placed somewhere in the 
neighbourhood of the deepest point towards the flat shore. So flooding means that 
many solutions far from the optimum are included, and the performance of a random 
choice (corresponding to the Gibbs algorithm) should decrease. Nevertheless, looking 
at the center of mass (corresponding to the Bayes algorithm), this quantity should be 
less sensitive to the deluge, especially if the shapes of the shores become more similar 
in the remote areas of the lake. So the Bayes algorithm should be less affected by a too 
high student temperature. This lower sensitivity of the Bayes algorithm to the negative 
effect of a very high noise, as contrasted to the case of Gibbsian learning, can be nicely 
observed in Fig. 7. 

6 Conclusions 

We have studied the influence of input- or output noise on the existence of phase trans- 
itions in the generalization behaviour of two-layer neural networks with non-overlapping 
receptive fields. Generally we find for Gibbs learning as a function of the reduced size 
a := p/N of the training set that the 'Aha-effect' phase transitions, where the system 
performs a simple guess for a < a c and only generalizes if a exceeds a critical value, 
persist in the presence of noise. However, the critical parameters scale with the strength 
of the noise, see e.g. Figs. 3 and 4. In particular, the order-index n of the system, which 
characterizes the behaviour at the transition, is unchanged: For n = 1 (e.g. the committee 
machine) the system starts to generalize already with arbitrarily small a, while for n = 2 
(e.g. the K = 2-parity machine) there is a continuous 'Aha-effect' transition (i.e. a 2nd 
order phase transition), whereas for n > 3 (e.g. the parity machine with K > 3, which 
has n = K) a discontinuous 'Aha-effect' transition happens. Only the so-called 'interim 
phase transitions', which appear in some cases for n — 1, e.g. the K4_mcode2 machine, 
get lost by input noise, but not by output noise. (At the 'interim transitions' only the 
strength of the generalization ability is enhanced from an already finite value for a < a c 
to a larger value above a c ). Therefore, the 'Aha-effect' transitions are - so to say - more 
generic than 'interim-transitions'. 

Concentrating on the simple perceptron and output noise we also studied the problem 
of an optimal choice for the noise level T s of the student machine, given that of the 
teacher. Looking at Gibbsian learning there is a critical a c above which a student trained 
with a finite noise level has a better performance than a 'perfect student', i.e. a student 
who has learnt the (partially corrupted) training set without errors. For a — > oo this 
optimal noise level approaches ss 0.6 T t (this is true for all network types considered 
here). The case is slightly different for Bayesian learning. Here the choice T s = T t is 
optimal for all a. Nevertheless the learning curve is quite flat around this optimal choice, 
so small deviations from the optimal noise have no large impact. 

Concerning replica symmetry breaking for the problem considered we found that 
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for given T t replica symmetry is typically conserved down to the optimal values of T s 
and somewhat below. But if a is larger than a critical noise-dependent value, replica 
symmetry breaking occurs for low noise-temperatures. Due to this fact, the sub-optimal 
overlap r calculated in a replica symmetric theory for this region of small T s is corrected 
to a higher value, which in a RSB1 calculation almost reaches the optimal number. 
This means that the 'gain' achieved by training with noise is less pronounced than 
predicted in e.g. the RS theory of [11]]. Replica symmetry breaking of higher order may 



lead to additional (slight) corrections in the region, where replica symmetry is broken. 
Nevertheless the effect of an optimal finite student noise rate T s in some cases was 
shown to be not artificial since the values of a resp. T s , below resp. above which replica 
symmetry remains preserved, and also the optimal value of T s in this RS region, as 
calculated in this paper, will remain unchanged. 

Finally we took a quick look on mechanisms allowing improved learning for imperfect 
students: We showed that error correction is possible, due to a sort of energy-entropy 
competition, leading to an increased overlap compared to the minimal-error solutions. 
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Figure captions 

Figure 1. The architecture of the class of networks considered: There are N binary 
input units separated into K different groups leading each to a 'hidden unit'. The 
binary outputs of the 'hidden units' are fed into a final Boolean output function. Only 
the weights w from the inputs to the hidden units can be modified by learning processes. 
Two kinds of noise will be considered below: 'input noise' and 'output noise'. 

Figure 2. The scaling factor r{(3) of Eqn. (51), which applies to the case q — > 1 of 
high loading, and the 'intuitive scaling factor' r 1 * = (1 — 2pf) 2 , which applies to the limit 
q — > 0, are presented as a function of the 'corrupted fraction' 2pj = 2e~ l3 /(l + e -/3 ) of 
the training set with respect to the output. /3 := (3 S — (3 t — 1/T S = 1/T t characterizes 
the output noise-levels of the student and the teacher machine. 

Figure 3. Data collapsing for output noise: On the left-hand-side, the overlap q(a) 
of the couplings of the student and teacher machines is presented as a function of the 
loading a = p/N, where p is the number of examples of the training set, for 4 different 
values of the common output noise-level T = 1/(3 of the student and the teacher (T = 0; 
0.5; 1.0 and 1.5), whereas on the right-hand-side the results (for T = 1.0, 1.5, 2.0 and 5.0) 
are presented as a function of a(3 2 . Note that for the parity machine with K = 2 and 3, 
respectively, one has an 'Aha-Effect' phase-transition of 2nd order (n=2) and 1st order 
(n=3) respectively, whereas for the committee machine and the K4_mcode2 machine, 
where the order index n — 1, the machine generalizes right from the beginning. For 
the committee machine, there is no phase transition at all, whereas for the K4_mcode2 
machine, there is an 'interim transition' around a(3 2 ~ 16, a situation, which is also 
compatible with n = 1, see [Q. 

Figure 4. For the K = 2 and K = 3 parity machines, and for the K = 3 committee, 
the prefactor c of the cj \fa -asymptotics of the error-probability e(a) is plotted against 
7 and pf, respectively, for the case of input noise; 7 = (3" 1 is the common noise-level 
of teacher and student machines, and pf is defined in (33) and (34). In the lower plot, 
the curves for the K = 3 parity and committee machines overlap to the accuracy of the 
drawing. 

Figure 5. Data collapsing for input noise: For the K = 2 parity machine and for 
the K4_mcode2 machine, with common noise temperatures 7 = 1/(3 of the teacher and 
student machine, ranging from 7 = via 1, 1.5, 2 to 5, the overlap q between teacher and 
student couplings is plotted against a and a/CJ 1 , respectively, where n is the order-index 
of the system and ( := (1 + 7) -1 . For the parity machine, the data collapsing is almost 
perfect, whereas for the second machine it applies only to the limits of small and high 
values of a/(. Note that here, in contrast to Fig. 3, the 'interim transition' of the second 
machine is destroyed by the input noise. 

Figure 6a. Storage capacity a c of a deterministic student perceptron (i.e. K = 1, 
T s = 0) in the presence of a noisy training set (T t 7^ 0): A fraction pj of the binary 
answers of the teacher perceptron are misclassified. Both, input- and output noise are 
considered. 

Figure 6b. Overlap r of a deterministic student perceptron (i.e. K — 1, T s — 0) 
with the teacher vector in the presence of a noisy training set with output noise strength 
T t = 1, plotted as a function of the reduced size a := p/N of the training set. The solid 



line is for Maximal Stability Learning, i.e. the AdaTron algorithm, p0| , while the dashed 



line is for Gibbs learning. The two overfitting effects appearing here are explained in the 
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text. 

Figure 7. For the perceptron (K = 1) and the case T t = 1 of the teacher's output 
noise-level and the three cases of a = 1, = 2, and = 5, the overlaps q and r for Gibbs 
learning, and r cp for the central-point network, i.e. Bayesian learning, as explained in 
the text, are plotted over the student machine's output noise-level T s . Note that from r 
an optimal noise temperature T s = T opt can be defined. 

Figure 8. Here the optimal student machine's output noise temperature T opt , as 
determined in Fig. 7, is presented over a for T t = 1 (Gibbs learning, K — 1). The 
dashed line is the asymptotic limit for a — > oo. 

Figure 9. For the perceptron (K = 1) with Gibbs learning, for various values of a 
and fixed teacher machine's output noise temperature T t = 1, the training error et r of 
Eqn. fl69|) is plotted over the output noise temperature T s of the student machine. 

Figure 10. For Gibbs learning with a = 5, T t = 1 and K = 1 the order parameters 
r RS and q (for replica symmetry), and r RSB , go and q\ (in 1-step replica symmetry 
breaking) are presented over the student perceptron's output noise temperature T s . For 
values of T s which are slightly smaller than the optimal value T s « 0.35, replica symmetry 
is broken and the overlap r is somewhat enhanced with respect to the RS case, but 
still smaller than the optimum. 

Figure 11. For T t = 1 and K — 1, the optimal overlap r(T s = T opt (a)]a) is pre- 
sented. For comparison, also the overlap obtained for T s = 0.1, where replica symmetry 
is broken, is plotted, both in RS approximation and in RSB1, where the result is only 
slightly sub-optimal. 

Figure 12. This figure suggests an analogy, making plausible that replica symmetry 
is broken for small student machine's noise-level(left scenario), but is restored beyond a 
critical value of T s , for fixed teacher machine's noise-levelT 4 (right scenario). 

Figure 13. For a general Boolean function B and the case of output noise, the ratio 
of the prefactors c /cg pt for the asymptotic behavior e(a — » oo) — > c /a is plotted against 
T s /T t for the values of T t = 0.2, 0.5, 1.0 and 2.0. Note that the optimal value T s »s 0.6T t 
is more or less universal, and the behaviour in the vicinity of this point is rather flat. 

Figure 14. Comparison of the overlap r(a = 2) of our theory for K = 1 as a function 
of T s for K = 1 and T t = 0.91 (dashed line, calculated as in Fig. 7) with results from a 
simulation of the states of a small system as described in the text (solid line). In both 
cases there is an optimal output noise strength T s . For more details see the text. 

Figure 15. For a teacher output noise-level of T t = 1 and a = 2 the curves for r, eo 
and e tr are shown for K — 1 as a function of the noise-level T s of the student machine. 
The increase in r is related to a decrease of eo showing the error correcting behaviour of 
the non-perfect student. 

Figure 16. Phase space analysis according to the formalism of [|1|, [16|] of a perceptron 
(i.e. K = 1) with T t = 1 and a = 5, i.e. the overlap r(k) has been presented as a function 
of the typical volume measure (-k) as explained in the text. The student output noise 
temperatures of the different curves correspond to T s =1.0, 0.8, 0.6, 0.4, 0.3, 0.2, 0.16, 
0.14, 0.12 and 0.1, from the left. The asterisk denotes the 'typical result' as obtained 
according to the usual RS calculation. 

Figure 17. This sketch shall make plausible, why training with noise can lead to an 
enhancement of the overlap of the couplings of the student and the teacher machine's 
couplings. For details see the text. 
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